Day 1: Introduction
2023-07-24
In this session, you learn how to use the tools of the hunt. We will:
git version control with GitHub (you will get the material for the course in this step).R:
R Refresher Woody Kelly via unsplash.com
While graphical user interfaces (GUIs) can be more visually intuitive and user-friendly, the command-line interface (terminal):
R because you don’t want to memorise the same 15 clicks in Excel and repeat them again and again. Likewise, you can save some terminal commands with things you do regularlyhomebrewchocolatey or scoopgit (and if you are on Windows unxutils) through a package managergit and why should you use itGit is a version control system (VCS) that helps keep track of changes made to files and directories in a project. It allows you to revert to previous versions, compare changes over time, and collaborate with others on the same project without overwriting each other’s work.
You can use git to track your work by setting up a repository, often called a “repo”. Here’s how:
git repository by running the command git init. This creates a new subdirectory named .git that contains all necessary git metadata.Committing is the process by which you save changes to the repository.
git to start tracking changes in specific files, you need to add them to the repository with git add filename. If you want to add all files in the directory, you can use git add ..git commit -m "Your commit message". The message should be a brief description of the changes made.1git status.If you make a mistake, or simply want to go back to an earlier version of your project, you can use git checkout [commit hash].
git loggit checkout [commit hash] to go back to an earlier status of your repositorygit checkout masterIf you want to undo the changes made in one commit, use git revert.
git loggit revert --no-commit [commit hash] to go back to an earlier status of your repositorygit commit -m "Your commit message"Branching in git allows you to create a separate version of your project to develop and test new features without affecting the main branch.
git branch branch-name.git checkout branch-name.git checkout masterRebasing is a way to integrate changes from one branch into another. This can be important if you want to merge the changes from a branch to the main branch, but there were changes there that you want to integrate.
git checkout branch-name.git rebase other-branch-name to integrate changes from other-branch-name.Merging is the process of integrating changes from one branch into another.
git checkout branch-name.git merge other-branch-name.A fork is a copy of a repository that allows you to freely experiment with changes without affecting the original project. Forking is commonly used in open source projects to propose changes to someone else’s project, or to use someone else’s project as a starting point for your own work.
A pull request is a way to propose changes from your fork or branch to the original repository. It’s how you contribute to open source projects on platforms like GitHub.
git clone https://github.com/JBGruber/ess-web-scraping.gitgit push to upload the changes to your forkR RefresherR organises its functions in packages (even base functions)If you do not want to attach an entire package, you can use the Double Colon to only use a specific function:
Sepal.Length
1 5.1
2 4.9
3 4.7
4 4.6
5 5.0
6 5.4
7 4.6
8 5.0
9 4.4
10 4.9
11 5.4
12 4.8
13 4.8
14 4.3
15 5.8
16 5.7
17 5.4
18 5.1
19 5.7
20 5.1
21 5.4
22 5.1
23 4.6
24 5.1
25 4.8
26 5.0
27 5.0
28 5.2
29 5.2
30 4.7
31 4.8
32 5.4
33 5.2
34 5.5
35 4.9
36 5.0
37 5.5
38 4.9
39 4.4
40 5.1
41 5.0
42 4.5
43 4.4
44 5.0
45 5.1
46 4.8
47 5.1
48 4.6
49 5.3
50 5.0
51 7.0
52 6.4
53 6.9
54 5.5
55 6.5
56 5.7
57 6.3
58 4.9
59 6.6
60 5.2
61 5.0
62 5.9
63 6.0
64 6.1
65 5.6
66 6.7
67 5.6
68 5.8
69 6.2
70 5.6
71 5.9
72 6.1
73 6.3
74 6.1
75 6.4
76 6.6
77 6.8
78 6.7
79 6.0
80 5.7
81 5.5
82 5.5
83 5.8
84 6.0
85 5.4
86 6.0
87 6.7
88 6.3
89 5.6
90 5.5
91 5.5
92 6.1
93 5.8
94 5.0
95 5.6
96 5.7
97 5.7
98 6.2
99 5.1
100 5.7
101 6.3
102 5.8
103 7.1
104 6.3
105 6.5
106 7.6
107 4.9
108 7.3
109 6.7
110 7.2
111 6.5
112 6.4
113 6.8
114 5.7
115 5.8
116 6.4
117 6.5
118 7.7
119 7.7
120 6.0
121 6.9
122 5.6
123 7.7
124 6.3
125 6.7
126 7.2
127 6.2
128 6.1
129 6.4
130 7.2
131 7.4
132 7.9
133 6.4
134 6.3
135 6.1
136 7.7
137 6.3
138 6.4
139 6.0
140 6.9
141 6.7
142 6.9
143 5.8
144 6.8
145 6.7
146 6.7
147 6.3
148 6.5
149 6.2
150 5.9
Less often used, you can also do this with library:
R packagesOne of the most important commands in R is the ? though:
All help files in R follow the same structure and principle (although not all help file contain all elements):
... (called ellipsis or dots) which is passed to underlying function.Functions are easy to define in R:
[1] 1 1 1
[1] 55.0 5.5 5.5
Going through this bit by bit:
. or CamelCase but _ if you have multiple words)return() (can be implicit).Iterate over a vector:
i takes a different value each runApply function to each element of a vector/list:
foo <- function(i, silent = FALSE) {
if (!silent) {
message(i)
}
return(i)
}
x <- lapply(1:10, foo)
unlist(x) [1] 1 2 3 4 5 6 7 8 9 10
Also apply function to each element of a vector/list, but coerce types:
if can be used to conditionally run code:
Any code that evaluates to a logical (TRUE/FALSE) can be used:
You can extend this with else, which is executed when the original condition is FALSE:
RCommonly people referring to base R mean all functions available when starting R but not loading any packages with library(package).
4 6 8
11 7 14
[1] 198
[1] 6.1875
Mazda RX4 Mazda RX4 Wag Datsun 710 Hornet 4 Drive
Mazda RX4 Wag 0.6153251
Datsun 710 54.9086059 54.8915169
Hornet 4 Drive 98.1125212 98.0958939 150.9935191
Hornet Sportabout 210.3374396 210.3358546 265.0831615 121.0297564
Valiant 65.4717710 65.4392224 117.7547018 33.5508692
Hornet Sportabout
Mazda RX4 Wag
Datsun 710
Hornet 4 Drive
Hornet Sportabout
Valiant 152.1241352
[1] "mazda rx4" "mazda rx4 wag" "datsun 710"
[4] "hornet 4 drive" "hornet sportabout" "valiant"
[7] "duster 360" "merc 240d" "merc 230"
[10] "merc 280" "merc 280c" "merc 450se"
[13] "merc 450sl" "merc 450slc" "cadillac fleetwood"
[16] "lincoln continental" "chrysler imperial" "fiat 128"
[19] "honda civic" "toyota corolla" "toyota corona"
[22] "dodge challenger" "amc javelin" "camaro z28"
[25] "pontiac firebird" "fiat x1-9" "porsche 914-2"
[28] "lotus europa" "ford pantera l" "ferrari dino"
[31] "maserati bora" "volvo 142e"
Especially for simple operations and statistics, base is still great.
Call:
lm(formula = hp ~ mpg, data = df)
Residuals:
Min 1Q Median 3Q Max
-59.26 -28.93 -13.45 25.65 143.36
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 324.08 27.43 11.813 8.25e-13 ***
mpg -8.83 1.31 -6.742 1.79e-07 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 43.95 on 30 degrees of freedom
Multiple R-squared: 0.6024, Adjusted R-squared: 0.5892
F-statistic: 45.46 on 1 and 30 DF, p-value: 1.788e-07
Rbase also has a plotting system:
data.frames are very dominant)%>%, now native in R as |>transform(aggregate(. ~ cyl, data = subset(mtcars, hp > 100), FUN = function(x) round(mean(x, 2))), kpl = mpg * 0.4251) cyl mpg disp hp drat wt qsec vs am gear carb kpl
1 4 26 108 111 4 2 18 1 1 4 2 11.0526
2 6 20 168 110 4 3 18 1 0 4 4 8.5020
3 8 15 350 192 3 4 17 0 0 3 4 6.3765
You Can make this more readable by createing intermediate objects:
data1 <- subset(mtcars, hp > 100) # take subset of original data
data2 <- aggregate(. ~ cyl, data = data1, FUN = function(x) round(mean(x, 2))) # aggregate by taking rounded mean
transform(data2, kpl = mpg * 0.4251) # convert miles per gallon to kilometer per liter cyl mpg disp hp drat wt qsec vs am gear carb kpl
1 4 26 108 111 4 2 18 1 1 4 2 11.0526
2 6 20 168 110 4 3 18 1 0 4 4 8.5020
3 8 15 350 192 3 4 17 0 0 3 4 6.3765
Or you use the pipe:
subset(mtcars, hp > 100) |>
aggregate(. ~ cyl, data = _, FUN = function(x) round(mean(x, 2))) |>
transform(kpl = mpg * 0.4251) cyl mpg disp hp drat wt qsec vs am gear carb kpl
1 4 26 108 111 4 2 18 1 1 4 2 11.0526
2 6 20 168 110 4 3 18 1 0 4 4 8.5020
3 8 15 350 192 3 4 17 0 0 3 4 6.3765
tidyverse functions are written with pipes in mind and are named as verbs with the goal to tell you exactly what they do:
library(tidyverse)
mtcars |>
filter(hp > 100) |>
group_by(cyl) |>
summarise(across(.cols = everything(), .fns = function(x) x |> mean() |> round(2))) |>
mutate(kpl = mpg * 0.4251)# A tibble: 3 × 12
cyl mpg disp hp drat wt qsec vs am gear carb kpl
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 4 25.9 108. 111 3.94 2.15 17.8 1 1 4.5 2 11.0
2 6 19.7 183. 122. 3.59 3.12 18.0 0.57 0.43 3.86 3.43 8.39
3 8 15.1 353. 209. 3.23 4 16.8 0 0.14 3.29 3.5 6.42
Note: You can interject the View() command at any line in a complicated pipeline to see the intermediate result in a spreadsheet-style data viewer.
ggplot2Rpb_collect() from paperboy, what do the arguments ignore_fails and connections do?object.size().file_link <- "https://raw.githubusercontent.com/shawn-y-sun/Customer_Analytics_Retail/main/purchase%20data.csv"
df <- read.csv(file_link)
filtered_df <- df[df$Age >= 50,]
aggregated_df <- aggregate(filtered_df$Quantity, by = list(filtered_df$Day), FUN = sum)
names(aggregated_df) <- c("day", "total_quantity")
aggregated_df[order(aggregated_df$total_quantity, decreasing = TRUE)[1:5],] day total_quantity
162 162 73
460 460 73
123 123 61
183 183 60
340 340 57
“The language in which we express our ideas has a strong influence on our thought processes.”
― Donald Ervin Knuth, Literate Programming
This is where literate programming has a lot of advantages:
Quarto (and its predecessor R Markdown) were designed to make it easy for you to make the most of these advantages. We have already been using these tools throughout the workshop and I hope this made you more familiar with them.
Use the function report_template() from my package jbgtemplates to start a new report
Add some simple analysis in it and render
Create a new quarto document and use the following yaml header to start your research abstract:
---
title: "Your Research Title"
subtitle: "Abstract Introduction to Web Scraping and Data Management for Social Scientists"
author: Your Name
date: today
format: pdf
---
Why not just switch to Python?
R so why re-learn things from scratch?Why not just stick with R then?
Before you load reticulate for the first time, we need to create a virtual environment (and potentially install a version of Python). This is a folder in your project directory with a link to Python and the packages you want to use in this project. Why?
renv package).The first step is to check if Python is availabe already and to find where it is located on your system:
The easiest way to install Python for R projects is through reticulate (it also causes issues regulary though, so consider using your package manager):
Note, however, that your user name can not contain space or special characters. If that is the case, you should install miniconda on a different location than the default. For example reticulate::install_miniconda(path = "C:/tools/miniconda") (you need to create the folder C:/tools manually). Also note that system("whereis python") will not pick up this installation. Instead you can find the path using:
To do this, you first have to indicate the location where your Python executable lives (this path should always end in /bin/python or python.exe on Windows):
Then we can create a new virtual environment in the project folder:
# I build in this if condition to not accidentally overwrite the environment when rerunning the notebook
if (!reticulate::virtualenv_exists(envname = "../python-env/")) {
reticulate::virtualenv_create("../python-env/", python = python_location)
}
reticulate::virtualenv_exists(envname = "../python-env/")[1] TRUE
if (R.Version()$os == "mingw32") {
python_path <- "../python-env/Scripts/python.exe"
} else {
python_path <- "../python-env/bin/python"
}
python_path[1] "../python-env/bin/python"
[1] TRUE
We can write this to your .Renviron file (otherwise the Sys.setenv() line above needs to be in every script). Note: the variables in the .Renviron file are set when R is started.
The file should look something like this:
RETICULATE_PYTHON=/home/johannes/Dropbox/Teaching/ess-introduction-to-web-scraping/python-env/bin/python
reticulate and See if it is Workingpython: /home/johannes/Dropbox/Teaching/ess-introduction-to-web-scraping/python-env/bin/python
libpython: /usr/lib/libpython3.11.so
pythonhome: /home/johannes/Dropbox/Teaching/ess-introduction-to-web-scraping/python-env:/home/johannes/Dropbox/Teaching/ess-introduction-to-web-scraping/python-env
version: 3.11.3 (main, Jun 5 2023, 09:32:32) [GCC 13.1.1 20230429]
numpy: /home/johannes/Dropbox/Teaching/ess-introduction-to-web-scraping/python-env/lib/python3.11/site-packages/numpy
numpy_version: 1.25.1
NOTE: Python version was forced by RETICULATE_PYTHON
reticulate::py_install() installs package similar to install.packages(). Let’s install the packages we need:
But there are some caveats:
sklearn)bertopic, bertopic[gensim], bertopic[spacy])If you see the $ in the beginning, these are command line/bash commands. Use the ```{bash} chunk option to run these commands and use the pip and python versions in your virtual environment (you could also activate the environment instead).
```{bash}
#| eval: false
./python-env/bin/pip install -U pip setuptools wheel
./python-env/bin/pip install -U 'spacy'
./python-env/bin/python -m spacy download en_core_web_sm
./python-env/bin/python -m spacy download de_core_news_sm
```On Windows, the binary files are in a different location:
General tip: see if the software distributor has instructions, like the excellent ones from spacy:
In my opinion, a nice workflow is to use R and Python together in a Quarto Document. All you need to do to tell Quarto to run a Python, instead of an R chunk is to replace ```{r} with ```{python}.
[1] "Hello World! From R"
Hello World! From Python
You can even set up a shortcut to make these chunks (I like Ctrl+Alt+P):
To get an interactive Python session in your Console, you can use reticulate::repl_python().
As you’ve seen above, the code is pretty similar, with a few key differences:
= instead of <-data.frame class, instead you have dictionaries or the DataFrame from the Pandas package*apply family of functions and vectorised code does not exist as such – everything is a for loop! can only concatenate list (not "int") to list
3
4
5
6
7
8
9
10
11
12
my_dict = {'name': ['John', 'Jane', 'Jim', 'Joan'],
'age': [32, 28, 40, 35],
'city': ['New York', 'London', 'Paris', 'Berlin']}
my_dict{'name': ['John', 'Jane', 'Jim', 'Joan'], 'age': [32, 28, 40, 35], 'city': ['New York', 'London', 'Paris', 'Berlin']}
reticulate MagicThe truly magical thing about reticulate is how seamless it hands objects back and forth between Python and R:
[1] "Hello World! From Python"
[1] 1 2 3 4 5 6 7 8 9 10
$name
[1] "John" "Jane" "Jim" "Joan"
$age
[1] 32 28 40 35
$city
[1] "New York" "London" "Paris" "Berlin"
'Hello World! From R'
{'num': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10], 'let': ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J']}
{'df': {'num': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10], 'let': ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J']}, '': [11, 12, 13, 14, 15, 16, 17, 18, 19, 20]}
What I think is especially cool is that this even works with functions:
You did not come to class to just scrape exercise pages. You probably had some initial data and/or research question in mind. Please write a short abstract (~200-400 words) on what you want to accomplish with the web scraping skill you will learn here, so we can try and incorporate the necessary tools in one of the sessions this week. The abstract should include what data can be found on the website and what potential research quesions you have in mind.
Deadline: Tuesday midnight
Save some information about the session for reproducibility.
R version 4.3.1 (2023-06-16)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: EndeavourOS
Matrix products: default
BLAS: /usr/lib/libblas.so.3.11.0
LAPACK: /usr/lib/liblapack.so.3.11.0
locale:
[1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C
[3] LC_TIME=nl_NL.UTF-8 LC_COLLATE=en_GB.UTF-8
[5] LC_MONETARY=nl_NL.UTF-8 LC_MESSAGES=en_GB.UTF-8
[7] LC_PAPER=nl_NL.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=nl_NL.UTF-8 LC_IDENTIFICATION=C
time zone: Europe/Amsterdam
tzcode source: system (glibc)
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] reticulate_1.30 lubridate_1.9.2 forcats_1.0.0 stringr_1.5.0
[5] purrr_1.0.1 readr_2.1.4 tidyr_1.3.0 tibble_3.2.1
[9] ggplot2_3.4.2 tidyverse_2.0.0 dplyr_1.1.2
loaded via a namespace (and not attached):
[1] Matrix_1.6-0 gtable_0.3.3 jsonlite_1.8.7 compiler_4.3.1
[5] Rcpp_1.0.11 tidyselect_1.2.0 png_0.1-8 scales_1.2.1
[9] yaml_2.3.7 fastmap_1.1.1 lattice_0.21-8 R6_2.5.1
[13] generics_0.1.3 knitr_1.43 munsell_0.5.0 pillar_1.9.0
[17] tzdb_0.4.0 rlang_1.1.1 utf8_1.2.3 stringi_1.7.12
[21] xfun_0.39 timechange_0.2.0 cli_3.6.1 withr_2.5.0
[25] magrittr_2.0.3 digest_0.6.33 grid_4.3.1 rstudioapi_0.15.0
[29] rappdirs_0.3.3 hms_1.1.3 lifecycle_1.0.3 vctrs_0.6.3
[33] evaluate_0.21 glue_1.6.2 codetools_0.2-19 fansi_1.0.4
[37] colorspace_2.1-0 rmarkdown_2.23 tools_4.3.1 pkgconfig_2.0.3
[41] htmltools_0.5.5